The MI plot (fig) of the Starr-seq reads (40+40bp surrounding TSS) shows clear signals at the diagonal, together with additional long-distance cooperations. However, the underlying binding specificities that give rise to the signal is unclear.

MI plot of the 40+40bp surrounding Starr-seq TSS.

MI plot of the 40+40bp surrounding Starr-seq TSS.

To address this, kmer were counted for all position-pairs with considerable MI signal, the kmer ranks were then used for clustering. The PCA result (fig) shows that there are 3 distict clusters.
PCA clustering of the kmer ranks for position-pairs

PCA clustering of the kmer ranks for position-pairs

When mapped back (the following fig), we can see that both cluster 2 and 3 are from the long-distance MI signals downstream of TSS
Clusters mapped back to MI plot

Clusters mapped back to MI plot

To get a fast insight, an xyplot between the 3 clusters is generated. The results likely suggest that Cluster1 consists of a mixture of many weak specificities, while cluster 2 and 3 consist of stronger specificities. An examination on the preferred kmers in cluster 2 and 3 indicate that cluster 2 and 3 indicates the same binding event, as the kmers there are a frameshift

kmer counts between the 3 clusters